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1 Introduction 

The concept of k- anonymity, used in the recent hterature (e.g., [T0l[TTl[7l[5l[l]) to 
formally evaluate the privacy preservation of published tables, was introduced 
in the seminal papers of Samarati and Sweeney [TUl E] based on the notion 
of quasi-identifiers (or QI for short). The process of obtaining fc-anonymity 
for a given private table is first to recognize the QIs in the table, and then 
to anonymize the QI values, the latter being called k-anonymization. While 
k-anonymization is usually rigorously validated by the authors, the definition 
of QI remains mostly informal, and different authors seem to have different 
interpretations of the concept of QI. 

The purpose of this short note is to provide a formal underpinning of QI 
and examine the correctness and incorrectness of various interpretations of QI 
in our formal framework. We observe that in cases where the concept has been 
used correctly, its application has been conservative; this note provides a formal 
understanding of the conservative nature in such cases. 

The notion of QI was perhaps first introduced by Dalenius in [3] to denote a 
set of attribute values in census records that may be used to re-identify a single 
or a group of individuals. To Dalenius, the case of multiple individuals being 
identified is potentially dangerous because of collusion. In [TUIIII], the notion 
of QI is extended to a set of attributes whose (combined) values may be used 
to re-identify the individuals of the released information by using "external" 
sources. Hence, the appearance of QI attribute values in a published database 
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table may give out private information and must be carefully controlled. One 
way to achieve this control is by anonymizing QI attribute values, through a 
fc-anonymization process. 

The fc-anonymization process, as first defined in [TUl E] , amounts to gener- 
alizing the values of the QI in the table so that the set of individuals, who have 
the same generalized QI attribute value combination, forms an anonymity set of 
size no less than fc, following the pioneering work on anonymity set by Chaum 
[5] . (According to a later proposal for terminology [3] , "Anonymity is the state 
of being identifiable within a set of subjects, the anonymity set.") The resulting 
table is said to be fc-anonymous. 

The notion of QI is hence fundamental for fc-anonymity. However, in [10] . 
QI is only informally described. The paper seems to assume that all attributes 
that may be available from external sources should be part of QI. Recent papers 
(e.g., [7l[T]) appear to use a similar informal definition, but with a variety of 
interpretations (see below). 

The only formal definition of QI that we found in the literature appears in 
The definition is rather complicated, but from that definition, we under- 
stand that a set of attributes Qt in a table T is a QI if there exists a specific 
individual r^, such that, only based on the combination of the values for Qx 
associated with rj, it is possible to re-identify that specific, single individual. 

From the above formal definition emerges that what really characterizes a 
QI is the ability to associate a combination of its values with a single individ- 
ual. The same notion seems to be captured by Def. 2.1 of [6j. We shall call 
the QI defined this way 1-QI (the number 1 intuitively indicates the number of 
individuals identified by the QI). This formal definition seems to deviate from 
the original idea of Dalenius [3] which gave importance to the identification of 
groups of individuals. Although Dalenius was only concerned about collusion, 
the identification of groups of individuals is closely related to the anonymity set 
concept and should not be ignored. This deviation actually leads to incorrect- 
ness as we shall show in this short note. 

Many studies on fc-anonymization have since appeared in the literature. 
However, different authors seem to interpret the concept of QI differently. In 
addition to the original interpretation of QI as (1) the set of all the attributes 
that appear in external sources [T^, and (2) a set of attributes that we call 1-QI 
[11] , we found the following use of QI in fc-anonymization: (3) use the minimum 
QI, i.e., the minimum set of attributes that can be used to re-identify individu- 
als OH], and (4) anonymize the multiple minimum QIs in the same table [12] 
since the minimum QI is found not unique. 

Through a formal study of the notion of QI, we conclude in this short note 
that the use of QI as in category (1) is correct but conservative, while the use of 
QI as in the other three categories is incorrect. Hence, the contribution of this 
short note is: (a) the concept of QI and its role in fc-anonymity are clarified, 
and (b) the conservative nature of the techniques in the recent papers is better 
understood. Point (b) above can further lead to (c) new possibilities for more 
focused data anonymization to avoid over conservativeness. 

The remainder of this short paper is organized as follows. Section [2] gives 
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some preliminary definitions. Section [3] introduces a new formafization of k- 
anonymity, and Section |4] defines the notion of QIs and links the QI with k- 
anonymity. Section [5] shows fc-anonymity using QI other than all the external 
attributes is problematic, and Section [S] formalizes in our framework the con- 
servative assumption currently used for fc-anonymization and provides a proof 
that the approach is sufficient but not necessary. Section [7] concludes the paper. 

2 Preliminary definitions 

Following the convention of the /c-anonymity literature, we assume a relational 
model with the bag semantics (or multiset semantics). We assume the standard 
bag-semantic definitions of database relation/table, attribute and tuple, as well 
as the standard bag-semantic definitions of the relational algebra operations. 
In particular, under the bag semantics, relations allow duplicate tuples and 
operations keep duplicates [S]. 

We shall use T (possibly with subscripts) to denote relational tables, t (pos- 
sibly with subscripts) to denote tuples in tables, and Attr[T] to denote the 
attribute set of table T. We shall also use A and B (possibly with subscripts) 
to denote both sets of attributes and single attributes as the difference will be 
clear from the context. 

To prevent private information from leaking, the fc-anonymization approach 
is to generalize the values in a table. For example, both ZIP codes "22033" 
and "22035" may be generalized to the value "2203*" , an interval value [22000- 
22099], or a general concept value "Fairfax, Virginia". The idea is that each 
"generalized" value corresponds to a set of "specific" values, and the user of the 
table can only tell from the general value that the original value is one of the 
specific values in the set. 

The set of specific values that corresponds to a general value can be formally 
specified with a decoding function. This decoding function, denoted Dec{), 
maps a value to a non-empty set of values. The domain of Dec{) is said to be 
the general values, denoted Dg, and the range of DecQ is the non-empty subsets 
of the specific values, denoted Ds- As such, all attributes in our relational tables 
will use the same domain, either Dq (for generalized tables) or Ds (for specific 
tables). We assume that Ds is a subset of Dq and decoding of a Ds value is 
the set consisting of the value itself. In addition, we assume that the decoding 
function is publicly known and hence all the privacy protection is from the 
uncertainty provided by the set of values decoded from a single one. 

The decoding function is trivially extended to tuples, by decoding each of the 
attribute values in a tuple. More specifically, given a tuple t with generalized 
values on attributes Ai, . . . , An, Dec{t) gives the set of tuples Dec{t[Ai\) x 
• • • X Dec(t[An]), i.e., the cross product of the decoding of each attribute. In 
other words, the decoding of a tuple t gives rise to the set of all specific tuples 
that would be generalized to t. The decoding function is similarly extended to 
tables, yielding a set Dec{T) of tables from a given T. Specifically, given a table 
T = ti, . . . ,tm a table T' ~ t[, . . . ,t'^ is in Dec{T) if t'^ is in Dec{ti) for each 
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i — I, . . . ,n. 

In the fc-anonymization literature, tables may be generalized using a local 
encoding or global encoding [5]. {Encoding refers to the process of obtaining the 
general value from a specific one.) The difference is that in global encoding, the 
different appearances of a specific value are generalized to the same generalized 
value, while in local encoding, they may be generalized to different generalized 
values. The formalization with Dec{) function is oblivious to this difference, 
and is correct in the sense that with either approach, the original table is in 
Dec{T). The Dec{) approach is justified as we are not concerned in this short 
note with specific anonymization techniques. 

3 The world and fc-anonymity 

In this section, we formally define the notion of fc-anonymity without using 
QIs. We will introduce the QI concept in the next section. The approach is 
in contrast to defining fc-anonymity based on the concept of QI as traditionally 
done. We note that our approach is a logical one since only when we can define 
fc-anonymity independently of QI, we may prove the correctness of a particular 
definition of QI. 

The world 

To start with, we model all the external sources that can be used to re-identify 
individuals as a world. A world W conceptually is a blackbox that uses attribute 
values for re-identification. That is, given a tuple t on some of the attributes 
of W , the world W will give back the set of individuals that have the attribute 
values given by t. Formally, 

Definition 1. A world W is a pair {Attr\W], RelDw), where Attr[W] is a set 
of attributes, and RelDw is a function that maps the tuples on the schemas that 
are non-empty subsets of Attr[W], with domain values from Ds, to the finite 
sets of individuals. 

In other words, given a relation schema R C ^ttr[l/l^] and a tuple t on R with 
values from Ds^ ReIDw{t) gives the set of individuals that possess the attribute 
values given in t. We say that an individual in ReIDw{t) is an individual re- 
identified with t by W , or simply re-identified with t when W is understood. In 
this case, we may also say that tuple t re-identifies the individual. 

Since the RelDw function re-identifies individuals with their attribute val- 
ues, one property we call "supertuple inclusion" should hold. For example, if 
a person is in the set P of individuals re-identificd with ZIP code 22032 to- 
gether with gender male, then this person should be in the set P' re-identified 
with ZIP code 22032 alone, i.e., P C P' . On the other hand, if a person is in 
P' , then there must be a value of gender (either male or female) so that the 
person must be re-identified with ZIP code 22032 and gender male (or female) 
together. More generally, supertuple inclusion property means that if we add 
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more attributes to a tuple t resulting in a "supertuple" , then the set of indi- 
viduals re-identified will be a subset of those identified with t, and at the same 
time, each individual re-identified with t will be re-identified with a particular 
supertuple of t. Formally, we have: 

Definition 2. A world W — {Attr[W], RelDw) is said to satisfy the super- 
tuple inclusion property if for each tuple t on attribute set A C Attr[W] and 
each attribute set B, with A Q B Q Attr\W], there exist a finite number of 
tuples ti,...,tq on B such that (1) ti[Ai\ = for each i — and 
(2) RelDwit) = RelDwih) U • • ■ U ReIDw{tq). 

In the sequel, we shall assume all the worlds satisfy the supertuple inclusion 
property. 

We also assume that, in the sequel, each world we consider is a closed world, 
in which all the relevant individuals are included. That is, the set of individuals 
identified by ReIDw{t) consists of all the individuals who have the attribute 
values given by t. We shall further motivate this assumption at the the end of 
this section. 

A world is called a finite world if RelDw maps only a finite number of tuples 
to non-empty sets. In the sequel, we assume all worlds are finite worlds. 

In summary, we assume in the sequel all the worlds (1) satisfy the supertuple 
inclusion property, (2) are closed, and (3) are finite. 

The function RelDw in a world W is naturally extended to a set of tuples. 

The above conceptual, blackbox worlds may be concretely represented as 
finite relations. In particular, a world W = (Attr[W], RelDw) can be repre- 
sented as a relation W on with domain Ds, having the condition that 
W includes attributes, such as SSN, that directly point to an individual. In 
this case, function RelDw will simply be a selection followed by a projection. 
For example, if SSN is the attribute to identify individuals, then ReIDiY{t) is 
defined as T^ssNO'R=t{W), where R is the schema for tuple t. 

In this relational view of W, table W may be considered as a universal 
relation storing for each individual all the associated data that are publicly 
known. As in previous work on this topic, for the sake of simplicity, we also 
assume that the information of one individual is contained in at most one tuple 
of W. (We will explain in Section [7] how this assumption can be avoided.) We 
also assume that one tuple of W contains information of only one individual. 
Furthermore, we assume there is a public method that links a tuple of W with 
the individual that the tuple corresponds to. This public method may be as 
simple as an attribute, such as the social security number, in W that directly 
points to a particular individual. 

For example, W may contain the attributes SSN, Name, Birth Date, Gender, 
Address, Voting record, etc. Each tuple of W corresponds to one individual 
pointed by the SSN. Other attributes give the other property values of the 
individual. 

Note that the supertuple inclusion property is automatically satisfied by any 
relational world. 
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/s-anonymity 

In our environment, to provide privacy in a published table is to avoid any 
attacker from using the world W to re-identify the individuals in the published 
table. The fc-anonymity is stronger, namely, it avoids any attacker from using 
the world W to re-identify the individual to be among less than k individuals. 
This intuition is captured more formally in Definition |4] below. 

In order to simplify notation, in the following we use EAttr[T] to denote the 
public attributes of T, formally defined as ylttr[VF] r\Attr[T] when the world W 
is understood. 

In the above discussion, the case of individuals re-identified is a special 
case. This is the case when a tuple TTEittr[T]{T) does not re-identify anyone by 
W, it would actually be a mistake since T is supposed to represent information 
of some individuals and the world is assumed to be closed. If this individuals 
case happens, it must mean that the closed world we have is not "consistent" 
with the table in our hand. This observation leads to the following: 

Definition 3. Given a table T , a world W is said to be consistent with T if 
I Ut'eDec(t) ReIDw{t')\ > for each tuple t in -k PAttr[T]{T) . 

A consistent world for a table T is one that can re-identify all the indi- 
viduals whose information is represented in T. In the sequel, we assume the 
world is consistent with the tables under discussion. We provide motivation 
for this assumption at the end of this section when we discuss the closed world 
assumption. 

We are now ready to define /c-anonymity. 

Definition 4. Let k > 2 be an integer, W a world, and T a table with Bittr[T] ^ 
0. Then T is said to be fc-anonymous with respect to W if for each tuple t in 
■^PAttr[T]{T), we have \ \Jt'eDec{t) ^^^^w{t')\ > k. 

In the above definition, the Dec{) function is implicitly assumed as public 
knowledge, and IJ is the set union that removes duplicates. Intuitively, the 
definition says that T is fc-anonymous if for each tuple t in T^pAttr[T] (T), we can 
find at least fc individuals from W having values for attributes Bittr[T] as given 
by Dec{t). 

We note that since external information is considered for re-identification, 
fc-anonymity should be formally defined with respect to that information, and 
not simply on the original private table (which may be conservatively considered 
as a special explained in Section ^ , as done in most previous work. 

As an example, assume the table in Figure [Ua) is the world, in which the 
ID attribute is one that directly connects to actual individuals. Table T in 
Figure [Hb) is 2-anonymous since each tuple (giving either 20032 or 20033 as 
the zip code value) will re-identify two individuals through W. For table T', 
the decoding function will map name J* to the set of all names that start with 
J. Hence, the first and the second tuples of T' will re-identify four individuals 
while the third tuple of T' will re-identify two individuals. Therefore, table T' 
is also 2-anonymous. 
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ID 


FirstName 


ZIP 






HI 


John 


20033 




ZIP 


Disease 


H2 


Jeanne 


20034 




20033 


Dl 


H3 


Jane 


20033 




20033 


D2 


H4 


Jane 


20034 




20034 


D3 




(a) The world W. 


(b) A table T. 



FirstName 


Bonus 


J* 
J* 

Jane 


$10K 

$100K 

$20K 



(c) Another table T'. 
Figure 1: The world W and two published tables. 



Anonymity and uncertainty 

As mentioned earlier, the notion of fc-anonymity provides protection by forming 
an anonymity set of size k. This notion should not be confused with protection 
using uncertainty in terms of private values. For example, for T' in Figure [TJc), 
even if there is only one Bonus value for Jane (hence there is no uncertainty in 
terms of private values), since there are two Jane's in the world and attackers 
will not be sure which Jane gets the bonus, we therefore obtain 2-anonymity 
for Jane. Hence, uncertainty of private values is not a necessary condition for 
protecting privacy. 

However, in other situations, uncertainty is required. For example, take T 
in Figure m^b) and assume that the public knows that the first two tuples are 
for two different individuals. (In most of the /c-anonymity literature, different 
tuples in T are assumed to be for different individuals.) Now if the second 
tuple had disease Dl instead of D2, there would not be enough protection via 
anonymity since there are only two individuals in ZIP 20033 in the world W. 
Indeed, in that case, both individuals Idl and IdS would have the same disease. 
The notions of /-diversity IT and (a, fc)-anonymity |13j are provided to solve 
this problem. 

Two observations arise from the example in the previous paragraph. Firstly, 
if we do not assume that the public knows the two tuples are for two different 
individuals, then there is no privacy leaking since there is no way of telling 
if any of the two individual has the disease (it could be the same individual 
diagnosed with the same disease twice). The second is that even if the public 
knows that the two tuples are indeed for two different individuals, 2-anonymity 
is still maintained for each tuple since there is no way of knowing which of the 
two indivuals the tuple corresponds to. The privacy leaking is due to a lack of 
uncertainty. The reader is referred to [2] for more discussion of uncertainty 
and anonymity (called indistinguishability in '14]). In this short note, we limit 
our discussion to fc-anonymity. 
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Practical considerations 

As observed in [TUl [TT| , in practice it is very difficult to check fc-anonymity on 
external sources, mainly due to the difficulty, if not impossibility, of knowing 
the closed world W that represents the complete knowledge of the external 
world. Indeed, it is not what we are proposing to do in this short note from an 
algorithmic point of view. Instead, we use this formal definition to clarify the 
role of quasi-identifiers, to give a precise semantics to fc-anonymity, and to study 
the conservative nature of generalization algorithms reported in the literature. 

On the other hand, from a practical point of view, it is possible that some 
global constraints exist on the world, and that they could be exploited by fc- 
anonymization algorithms. For example, if we know from census data that the 
combination (ZIP, Gender) has always no less than 500 individuals, any table Ts 
with PAttr[Ts] Q (ZIP, gender) is automatically fc-anonymous for any fc < 500. 
Further investigation of such a technique is beyond the scope of this short note. 

More on the closed world assumption 

The idea of the closed world assumption is that we define fc-anonymity based 
on the theoretically "complete" knowledge of the external world. However, it 
seems to be common in the literature that anonymity is defined based on the 
possible knowledge of the attackers. By definition, any knowledge an attacker 
has may very well be a part of the complete knowledge of a closed world. 

The question arises as whether we may define fc-anonymity based on the 
partial knowledge that an attacker has of the closed world. Two scenarios 
may be considered. In the first scenario, the attacker does not know all the 
individuals that a tuple can re-identify. That is, for example, given a ZIP code 
22032, the attacker only knows a subset of the individuals who reside in the 
area determined by this ZIP code. In this scenario, a tuple in TrpAttr[T]{T) 
may re-identify by the attacker a proper subset of the fc people that can be re- 
identified by using the closed world. We should not use such partial knowledge 
for fc-anonymity for two reasons. (1) The attacker will gain false information 
in the sense that he/she thinks that the individual is among fewer than there 
actually are. (2) If we needed to be concerned with the partial knowledge, then 
we needed to be concerned with all possible partial knowledge. In a particular 
partial knowledge, the attacker may always re-identify a single individual with 
any tuple. Then we would not be able gain fc-anonymity at all. Due to these 
two reasons, we should remain in our closed world assumption for this scenario. 

The other scenario is that the attacker either knows all the individuals that 
can be re-identified with a tuple by the closed world, or he/she docs not know 
anyone. For example, given a ZIP code 22032, the attacker either know all the 
individuals living in ZIP 22032, or he/she does not have any clue who might live 
in the area. This scenario is easier to deal with by simply removing these tuples 
in T for consideration. However, in order to be conservative, we probably do 
not want to do that, and again we come back to the conclusion that we need 
the closed world assumption. 
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Finally, the fact that a world is closed does not necessary mean that it has 
the complete knowledge of all the individuals of the whole universe and all 
their attributes. We only need the closed world to have the complete knowledge 
about the attributes and the individuals that T is concerned with. For example, 
if T only has attributes Ai, . . . ^Aq and only concerns residents in the state of 
Virginia, then the closed world will only need to have the complete knowledge 
of these attributes and the Virginia residents. 

4 Quasi-identifiers and /c-anonymity 

In order to understand the relationship between the notion of QI and fc-anonymity 
we formally define QI, or more precisely fc-QI, where fc > 1 is an integer. We 
then provide a sufRcient and necessary condition for fc-anonymity based on these 
notions. Intuitively, a set of attributes is a A:-QI of a world if a certain com- 
bination of values for these attributes can only be found in no more than k 
individuals of W , i.e., if that combination identifies a group of no more than k 
individuals. 

Definition 5. Given a world W and positive integer k, an attribute set A C 
^ttr[Vt^] is said to he a k-QI of W if there exists a tuple t on A such that 
^ \ReIDw{t)\ < k. 

For example, in the relational world W in Figure [IJa), ZIP is a 2-QI, First- 
Name is a 1-QI, and (FirstName, ZIP) combination is a 1-QI. 

Clearly, each set of attributes A C ylttr[M^] is a fc-QI for some k for a given 
finite world W. 

Note that the notion of QI formalized in llj and informally defined in other 
works is captured by our definition of 1-QI. Indeed, assume some values of 
QI uniquely identify individuals using external information. That is, if external 
information is represented by a world W, QI is any set of attributes A C ^ttr [VF] 
such that \ReIDw{t)\ — 1 for at least one tuple t on A. It can be easily seen 
that this is equivalent to the notion of 1-QI of W. 

Proposition 1. // a set of attributes is a k-QI, then it is an s-QI for each 
s>k. 

Thus, we know that each 1-QI is a fc-QI for fc > 2. It is clear that the inverse 
does not hold, i.e., if fc > 2 there exist fc-QI that are not 1-QI. For example, ZIP 
in the world W of Figure [ija) is a 2-QI, but not a 1-QI. 

Definition 6. A set A of attributes is said to be a proper k-QI if it is a k-QI 
but it is not an s-QI for any s < k. 

The following results directly from the supertuple inclusion property of the 
worlds: 

Proposition 2. // a set A of attributes is a k-QI, then any A' D A is a k-QI. 
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Note that the special case of Proposition [2] for 1-QI has been independently 
proved in [5]. 

The following sufficient condition for fc-anonymity says that if the full set 
of attributes appearing in external sources is a proper s-QI, then the table is 
fc-anonymous for each k < s. 

Theorem 1. A table T is k-anonymous with respect to a world W if Rittr[T] 
is a proper s-QI in W with k < s. 

The above theorem holds because by definition, an attribute set A C i34ttr [Vt^] 
is a proper k-Ql if for each tuple t on A either \ReIDw{t)\ = or \ReIDw{t)\ > 
k. Hence, if EAttr[W] is a proper s-QI we know that for each tuple t on i?4ttr[M^], 
we have either \ReIDw{t)\ = or \ReI Dw {t)\ > s. As we have always assumed 
that W is consistent with T, we know \ReIDw{t)\ > s. For /c-anonymity, it is 
enough that we have s > k. 

By the above theorem, if the general constraints on the external world ensure 
that R\.ttr[T] is an s-QI with s > k, then there is no need to anonymize table 
T if fc-anonymity is the goal. 

Now we can state the relationship between the fc-anonymity notion and the 
fc-QI notion. 

Theorem 2. A table T is k-anonymous with respect to a world W if and only if 
for each k-QI A ofW, with A C PAttr[T], we have \ [jt'eDec{t) ReIDw{t')\ > k 
for each tuple t G tta{T). 

Proof. The "if" part: Assume there is such a fc-QI A. By Proposition [21 we 
know BAttr[T] is a fc-QI. By hypothesis and the definition of fc-anonymity, we 
know T is fc-anonymous. If there is no such fc-QI, then Rittr[T] must be a 
proper s-QI with s > fc. In this case, by Theorem [1] T is fc-anonymous. 

The "only if" part: Assume T is fc-anonymous. By the assumption of that 
T is properly formed, fc-anonymity of T leads to | Ut'eZ5ec(t) 

ReIDw{t')\ > fc 

for each tuple t G T^pAttr[T]iT). By the supertuple inclusion property of the 
world W, we know | Ut'GDec(t) ReIDw{t')\ > k for each tuple t £ 7r^(T) and 
each attribute set A C EAttr[T] (and hence for each A C Eittr[T] that is a 
fc-QI). □ 

From the results of this section, we may have the following observations and 
conclusions. Given a table T, if any subset A of Bittr[T] is a fc-QI, then Rittr[T] 
itself is fc-QI. Hence, we need to make sure that the values on IAttr[T], not just 
a proper subset of EAttr[T], are general enough to gain fc-anonymity. On the 
other hand, if we have values on EAttr[T] general enough to have fc-anonymity, 
then the values on any proper subset of Rittr[T] will also be general enough 
due to the supertuple inclusion property. Therefore, for fc-anonymization, we 
should only be concerned with the attribute set Bittr[T], not any proper subset 
of it. In the next section, we shall show, in fact, limiting the consideration to 
any or all proper subsets of Bittr[T] will lead to privacy leaking. 
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5 Incorrect uses of QI in fc-anonymization 



As mentioned in the introduction, /c-anonymity in a published table can be 
obtained by generalizing the values of QI in the table. This process is called 
k-anonymization. As mentioned in the introduction, at least four different uses 
of QI in /c-anonymization have appeared in the literature. In this section, we 
point out the incorrectness of cases (2)-(4). We defer the study of case (1) to 
the next section. 

5.1 Use 1-QI only 

Firstly, we note that the use of 1-QI (e.g., the QI as defined in HT] and [6]) 
instead of fc-QI in the definition of fc-anonymity can lead to incorrect results. 
Indeed, accordingly to the current anonymization techniques, if an attribute 
is not in any QI, then the attribute is not considered for fc-anonymity or k- 
anonymization (see Def. 2.2 in [Q). 

However, if QI is taken as 1-QI as done in [lTl[6], it is a mistake. 

Consider the table T in Figure [IJb) for 3-anonymity. The public attribute 
of T is ZIP only, which is not a QI (or 1-QI). If we only consider 1-QI for 
table T, then we may incorrectly conclude that the table does not need any 
generalization (on ZIP values) in order to protect privacy. However, we know 
T is not 3-anonymous (but is 2-anonymous) against W in the same figure. In 
order to achieve 3-anonymity, we will need to generalize the ZIP values in T. 

Therefore, the k-anonymity requirements based only on 1-QI fail to protect 
the anonymity of data when k > 2. We can correct this problem by considering 
all fc-QIs, not just 1-QIs. 

5.2 Use a subset of PAttr[T] 

The public attributes of a table is given by Rittr[T]. A few papers seem to imply 
that only a subset of Rittr[T] needs to be considered. For example, [5l|4] define 
QI as the minimum subset of Bittr[T] that can be used to identify individuals, 
and [12] proposes to generalize all such minimum QIs. Even if we take QI as fc- 
QI, the use of the minimum subset is incorrect. We have the following important 
result. 

Theorem 3. Given an arbitrary T , an integer k > 2, and a world W , the fact 
that TTsiT) is k-anonymous for each proper subset B of Rittr[T] does not imply 
that T is k-anonymous. 

Proof. We prove the statement by showing that there exist a table T and a world 
W, accompanied by the decoding function Dec{), with Attr[VF] ~ Bittr[T] U 
{ID}, such that T is not fc-anonymous in W but each projection on a proper 
subset of EAttr[T] is fc-anonymous in W. Furthermore, in this world W, each 
subset A of Attr[W] is a 1-QI. 

Let PAttr[T]= Ai, . . . ,yl„ and Attr[W] = PAttr[T] U {ID}. For each i = 
1, . . . , n, let Dom{Ai) = {an, . . . , Uik}. Now let T — Dom{Ai) x • • ■ x Dom{An). 
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The number of tuples in T is fc", and assume we give each tuple a unique ID 
value from the set {!,..., A;"}. For each tuple {ai^i^ , ■ • ■ , an,i„) with tuple ID 
r, we generate the tuple {ai^i^^r, an,i„,r, r) for W and let W consists of all such 
tuples (and thus it has fc" tuples). We assume that the decoding function works 
as follows: Dec{aj,i) = {oj^i.rlf" = 1, • ■ ■ , k"}. 

It is clear that in W as constructed above, each subset of Attr[M^] is a 1-QI 
since each value only appears once. We now show that T is not fc-anonymous 
while TTv{T) is fc-anonymous for each proper subset V of B\ttr[T], and thus 
proving the proposition. 

We first show that T is not fc-anonymous. Pick an arbitrary t — (ai^j^ , . . . , 
in T, and assume its ID is r. Then {au-^^r, ■ ■ ■ ,o,n,i„,r) appears in TTmttrlT]^^). 
By construction of W, there are no other tuples of the form {ai^i^^r', ■ • ■ , an,i„,r') 
in W with r' ^ r. Hence, T is not fc-anonymous. 

Now consider a proper subset B of PAttr[T] with B — Ai,...,Ap and 
p < n . Note that this represents an arbitrary subset due to the symmetry of 
the attributes in T. Take a tuple t = {ai^i^, . . . ,ap^i^) e TrB{T). Because of 
the construction of W, we have {a^^i^^r, ■ ■ ■ ,a-p,ip,r) in ttb{W) for fc"~^' different 
r values since t appears in fc""'' number of tuples in T. It follows ttb{T) is 



fc-anonymous since n > p and {ai i-^ ^, ■ 
r values. 



r) is in Dec{t) for fc" p different 

□ 



By Theorem |31 we understand that we cannot simply apply generalization 
techniques on a proper subset of attributes of Rittr[T]. As an example, consider 
table T and its generalized version T' in Figure [H Attribute ID is a 1-QI, while 
ZIP is not a 1-QI (however the combination of ID and ZIP is). To generalize 
the minimum 1-QI, we would probably generalize table T to T' to make sure 
there are two appearance for each (generalized) ID value. However, it is clear 
that T' does not provide 2-anonymity in the world W given in the same figure. 



ID 


Name 


ZIP 




ID 


ZIP 


Disease 


Ml 


John 


20033 




Ml 


20033 


Dl 


Id2 


Jeanne 


20034 




Id2 


20034 


D2 


MS 


Jane 


20033 




Id3 


20033 


D3 


Id4 


Jane 


20034 




Id4 


20034 


D4 


(a) 


The world W. 


(b) Original table T. 



ID 


ZIP 


Disease 


[Idl-Id2] 


20033 


Dl 


[Idl-Id2] 


20034 


D2 


[Id3-Id4] 


20033 


D3 


[Id3-Id4] 


20034 


D4 



(c) r' with generalized ID values. 
Figure 2: Example without proper generalization 
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6 Conservativeness of previous approaches 



In practical scenarios, we do not know exactly what the world W is. In such 
scenarios, we may want to define fc-anonymity referring to all "possible" worlds, 
to guarantee "conservative" /c-anonymity. Indeed, this is the view taken by 
[10] and other researchers. In this subsection, we provide a formal correctness 
proof of a common practice of guaranteeing "conservative" fc-anonymity. (Here, 
"conservative" means "we would rather err on over protection" . ) 

The common practice we refer to is the following. Given a relational table T, 
assume each tuple contains information about a single, different individual. And 
assume that the public attributes that can be used to identify the individuals 
in T are EAttr[T]. Then T is fc-anonynious if for each tuple t in T, the value 
t[Eittr[T]] appears in at least k tuples in T. (Note that in the literature, the 
attributes Rittr[T] above is replaced with the "QI attributes", which would be 
a mistake if "QI attributes" do not mean Bittr[T] as shown in the previous 
section.) 

In contrast to the definition of fc-anonymity of this short note, with this 
common practice, no external world is mentioned. We shall show below that, 
in fact, this common practice provides fc-anonymity in a rather "conservative" 
sense with respect to all "possible" worlds. 

We observe that, in contrast to what we have so far, the table T in the 
common practice has an additional assumption that each tuple of T is for a 
different individual. Therefore, the requirement of a consistent world for such a 
table need to be upgraded. Earlier, we only needed a consistent world to be able 
to re-identify each tuple in T^pAttr[T] (T) with at least one individual. Here, since 
each tuple of T is assumed to be for a different individual, a consistent world 
must be able to re-identify each tuple in T^pAttr[T] (T) with a different individual. 

Definition 7. A world W is said to be individualized consistent with a table T 
with n tuples if there exist n individuals ii, . . . , i„ such that there exists T' = 
t'l, . . . , t'^ in Dec{T) satisfying the condition that ij is in Rel Dw{'!^PAttr[T]it'j)) 
for each j = 1, . . . , n. 

Intuitively, this means that T could be generalized from a table T' such that 
each tuple may be used to re-identify a different individual by W. 

The fact that a world W is individualized consistent with a table T basically 
confirms the assumption that each tuple of T can indeed re-identify a different 
individual. All other worlds are going to be "impossible" for table T since the 
assumption that a different tuple T is for a different individual cannot hold with 
such worlds. We can now capture the notion of conservative anonymity for such 
tables. 

Definition 8. A table T is said to be conservatively fc-anonymous if it is k- 
anonymous with respect to each W that is individualized consistent with T . 

We use the term "conservative" also to indicate the fact that we do not use 
any knowledge of the world, even if we have any, when fc-anonymity is consid- 
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ered. In the "practical consideration" part of Section [31 we had an example 
where knowledge of the world may be used. 

We are now ready to show that the common practice described earlier is 
correct, if EAttr[T] is taken as QI for a given table T. 

Theorem 4. Let T be a table such that there exists a world that is individualized 
consistent with T . Then T is conservatively k-anonymous if for each tuple t in 
T, there exist at least k — 1 other tuples ti, . . . , tk-i in T such that ti[Bittr[T]] = 
t[PAttr[T]] fori^l,...,k~l. 

Proof. Let be a world that is consistent with T, and t a tuple in T. By 
hypothesis, there exist k tuples ti, . . . ,tk in T (may include t itself) such that 
tj[PAttr[T]] = t[PAttr[T]], j = l,...,k. By definition of W, there exist k 
individuals ii,. . . ,ik such that ij is in RelDyylt'^), where t'^ is in Dec{tj), for 
i = l,...,/c. Thus, |U.ez3ec(t)-Re/^H/(t')l > \ReIDwit[)U- ■ ■UReIDw{t',,)\ > 
k. Hence, T is fc-anonymous wrt W. □ 

Theorem m shows that in general, if Bittr[T] is taken as the QI, the common 
(conservative) process of anonymization appeared in the literature is sufficient 
under the assumption that we have no knowledge of the world. 

The inverse of Theorem |4] does not hold. Indeed, consider T' in Figure [TJc). 
If the Dec{) function is such that Dec{J*) ~ {Jane} and £'ec(Jane) — {Jane}, 
then it is clear that T' is 3-anonymous with respect to all worlds that are 
consistent with T' because in any of these worlds, there must be at least 3 
individuals with the first name Jane. However, in T' we do not have three 
tuples with the same First name attribute values. 

The above example may be dismissed as using a strange decoding function. 
However, for any Dec{), we can always construct a table T such that the inverse 
of Theorem m does not hold. Formally, 

Theorem 5. For any decoding function, the inverse of Theorem does not 
hold. 

Proof. We only sketch the idea of constructing a counterexample table T for 
the inverse of Theorem 2] First obtain an arbitrary tuple < in T of an arbi- 
trary schema such that Pittr[T] ^ 0. We can make such a tuple t to satisfy 
the conditions \Dec{t[PAttr[T]])\ > 1 and t[PAttr[T]] is not in Dec{t[PAttr[T]]) 
(otherwise, there is no real generalization going on). Now for each (and ev- 
ery) tuple t' in Dec{t[Pittr[T]]), we generate a tuple t" for table T such that 
t"[PAttr[T]] = t'[PAttr[T]]. We duphcate this t" in T for k times. Since we 
assumed t[i?4.ttr[T]] is not in Dec{t[PAttr[T]]), we know that the condition of 
Theorem mis not satisfied for t. However, for any world W that is consistent 
with T, there will be at least k individuals in ReID]Y{t[PAttr[T]]). This is be- 
cause there must be Dec{t') = {<'} for each t' given above, and in this world 
W, \ReIDw{t')\ > k for each t' in Dec{t[PAttr[T]]) by definition of consistent 
worlds and the fact that t' appears k times in TrpAttr[T]{'r). Hence, tuple t 
satisfies the /c-anonymity condition, although t[i34ttr[T]] only appears once in 
T. □ 
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In Theorem [21 we showed that fc-anonymization of any or all proper subsets 
of BAttr[T] is no guarantee in obtaining A:-anonymity in T. We may extend the 
result to the conservative case. 

Theorem 6. Given an arbitrary T and integer fc > 2, the fact that TrB(T) is 
conservatively k-anonymous for each proper subset B of Rittr[T] does not imply 
that T is conservatively k-anonymous. 

Proof. We can simply use the table T constructed in the proof of Theorem [31 It 
is easily seen that TTsiT) is conservatively fc-anonymous for each proper subset 
B of {Ai, . . . , An} due to Theorem [3l and the fact that each tuple t[B] appears 
for at least k times (as shown in the proof of Theorem [3]). To show that T is 
not conservatively fc-anonymous we only need to construct one world W that is 
consistent with T and T is not fc-anonymous with respect to W. In fact, the 
same W constructed in the proof of Theorem [31 is easily seen consistent with 
T, and we have shown there that T is not fc-anonymous with respect to that 
W. □ 

As a final remark of this section, we note that if we do have some knowledge 
about the world and the Dec() function, we can in some cases do better than 
this conservative approach. For example, for table T' in Figure [ijc), if we know 
that Dec{J*) includes Jane, and there are more than 3 Jane's in the world, then 
T' is 3-anonymous for T' . Without such assumptions, we will have to generalize 
Jane to J* in order to achieve 3-anonymity. The investigation of how to take 
advantage of such knowledge in anonymization is beyond the scope of this short 
note. 



7 Discussion and Conclusion 

In summary, we have formally analyzed the notion of quasi-identifier as it is 
essential to understand the semantics of fc-anonymity. We have shown that the 
current formal definitions of QI are not satisfactory and any approach based 
on these definitions may lead to re-identification (i.e., privacy leakage) as this 
formally defined QI corresponds to 1-QI as defined in this paper. We also 
showed the problems with other definitions of QI. We have also formally proved 
the correctness of using all attributes that appear in external world as QI, and 
point out precisely what conservative assumptions are made along the way. 

We have provided a new formal framework for fc-anonymity that, by clarify- 
ing the role of quasi-identifiers, allows the designers of anonymization techniques 
to prove the formal properties of their solutions. The presented framework can 
also serve as the basis for generalization methods with more relaxed, or differ- 
ent assumptions. Indeed, the new notion of fc-anonymity enables improvements 
when assumptions can be made on the external information sources, i.e., the 
world W. 

Note that all through this short note, we have used "individuals" as the 
entities whose privacy need to be protected. Obviously, any entities whose 
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privacy need to be protected can be taken as the "individuals" , and the notion 
of fc-anonymity and fc-QI should carry over without change. 

Finally, it should be mentioned that the assumption of having at most one 
tuple for each individual in each relational world W can be removed (but 
each W tuple is still assumed only for one individual) if we assume to have 
a special attribute Rid S Attr[W^] storing the unique id of the individual 
for each W tuple. In this case the cardinality of different tuples should be 
checked on the (set-semantics) projection on this Rid attribute. For example, 
in Def. mthe formula \ [jt'<£Dec(t) > k should be substituted with 

\''^Rid[Jtr^]jec(t)'^mttr[T]=t'(^)\ ^ k. (Here, I I counts the number of distinct 
elements in a bag.) 
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